The Misuse of the NASA Metrics Data Program Data Sets for Automated Software Defect Prediction
نویسندگان
چکیده
Background: The NASA Metrics Data Program data sets have been heavily used in software defect prediction experiments. Aim: To demonstrate and explain why these data sets require significant pre-processing in order to be suitable for defect prediction. Method: A meticulously documented data cleansing process involving all 13 of the original NASA data sets. Results: Post our novel data cleansing process; each of the data sets had between 6 to 90 percent less of their original number of recorded values. Conclusions: One: Researchers need to analyse the data that forms the basis of their findings in the context of how it will be used. Two: Defect prediction data sets could benefit from lower level code metrics in addition to those more commonly used, as these will help to distinguish modules, reducing the likelihood of repeated data points. Three: The bulk of defect prediction experiments based on the NASA Metrics Data Program data sets may have led to erroneous findings. This is mainly due to repeated data points potentially causing substantial amounts of training and testing data to be identical.
منابع مشابه
New findings on the use of static code attributes for defect prediction Muhammed
Defect prediction includes tasks that are based on methods gener ated using software fault data sets and requires much effort to be completed. In defect prediction, although there are methods to conduct an analysis involving the classification of data sets and localisation of defects, those methods are not sufficient without eliminating repeated data points. The NASA Metrics Data Program (Nasa ...
متن کاملSoftware Defect Prediction Based on Competitive Organization CoEvolutionary Algorithm
In order to improve the accuracy of prediction for software defect data sets, competitive organization coevolutionary algorithm is presented and applied for dealing with the software defect data. During this algorithm, mechanism of competition is introduced into coevolutionary algorithm. Then leagues are formed based on the importance of attributes among them. And three evolution operators whic...
متن کاملUsing the Support Vector Machine as a Classification Method for Software Defect Prediction with Static Code Metrics
The automated detection of defective modules within software systems could lead to reduced development costs and more reliable software. In this work the static code metrics for a collection of modules contained within eleven NASA data sets are used with a Support Vector Machine classifier. A rigorous sequence of pre-processing steps were applied to the data prior to classification, including t...
متن کاملEvaluation of Classifiers in Software Fault-Proneness Prediction
Reliability of software counts on its fault-prone modules. This means that the less software consists of fault-prone units the more we may trust it. Therefore, if we are able to predict the number of fault-prone modules of software, it will be possible to judge the software reliability. In predicting software fault-prone modules, one of the contributing features is software metric by which one ...
متن کاملSoftware defect prediction using static code metrics : formulating a methodology
Software defect prediction is motivated by the huge costs incurred as a result of software failures. In an effort to reduce these costs, researchers have been utilising software metrics to try and build predictive models capable of locating the most defect-prone parts of a system. These areas can then be subject to some form of further analysis, such as a manual code review. It is hoped that su...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011